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Abstract 



The distribution N(x) of citations of scientific papers has recently been illustrated (on ISI 
and PRE data sets) and analyzed by Redner [Eur. Phys. J. B 4, 131 (1998)]. To fit 
the data, a stretched exponential (N(x) oc exp — (x/xqY) has been used with only partial 
success. The success is not complete because the data exhibit, for large citation count x, a 
power law (roughly N(x) oc a; -3 for the ISI data), which, clearly, the stretched exponential 
does not reproduce. This fact is then attributed to a possibly different nature of rarely 
cited and largely cited papers. We show here that, within a nonextensive thermostatistical 
formalism, the same data can be quite satisfactorily fitted with a single curve (namely, 

q 

N(x) oc 1/[1 + (q — 1) A x]"- 1 for the available values of x. This is consistent with the 
connection recently established by Denisov [Phys. Lett. A 235, 447 (1997)] between this 
nonextensive formalism and the Zipf- Mandelbrot law. What the present analysis ultimately 
suggests is that, in contrast to Redner 's conclusion, the phenomenon might essentially be 
one and the same along the entire range of the citation number x. 

Keywords: Citations; Nonextensive entropy; Zipf-Mandelbrot law; Complex phenom- 
ena. 



Half a century ago, Zipf |T| made his remarkable observations about some basic linguistic 
laws. More precisely, if we order the words appearing in a text (e.g., Homer's Iliad) from 
the most to the less frequent ones, thus obtaining a ranking (low rank for the most used, 
and high rank for the less used), we can plot, as a function of the rank, the number of times 
those words appear. Zipf showed that, excepting the words with extremely low rank, an 
inverse power law emerges (so called Zipf's law). The exponent exhibits interesting universal 
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aspects. For instance, for the spoken language, it appears to be very sensitive to the degree 
of instruction (primary, intermediate, highly academic) of the speaker, but very little to 
the particular culture (French, German, Anglo-saxon) . Later on, Mandelbrot pointed out 
connections of this phenomenon with fractals 0, and also suggested a further correction, 
namely that substantially better fittings can be obtained by using an inverse power law 
of the sum of the rank with a constant (so called Zip f -Mandelbrot law). A further step 
along this line was provided recently by Denisov ||. Indeed, using within the Sinai-Bowen- 
Ruelle thermodynamical formalism for symbolic dynamics, the nonextensive thermostatistics 
proposed some years ago by one of us Q , Denisov deduced the Zipf-Mandelbrot law. To be 
more precise, it is clear that, unless one uses a specific model, there is no way to deduce 
the precise values for the exponent and the additive constant. What Denisov deduced, 
from very generic entropic arguments, was the form of the law. In this sense, the approach 
is very analogous to those which succeed associating Gaussians to normal diffusion, and 
Levy or Student's t-distributions to anomalous superdiffusion (see, for instance, [[J and || 
respectively). Finally, it is important to stress here that, although the present problematic 
was historically triggered in Linguistics, the same kind of considerations are equally relevant 
to DNA sequences, artificial languages, and a variety of other stochastic, deterministic or 
mixed processes. 

Here we focus on an interesting analysis of data concerning citations of scientific pub- 
lications. More precisely, Redner @ recently exhibited and discussed the distributions of 
citations related to two quite large data sets, namely (i) 6 716 198 citations of 783 339 
papers, published in 1981 and cited between 1981 and June 1997, that have been catalogued 
by the Institute for Scientific Information (ISI), and (ii) 351 872 citations, as of June 1997, 
of 24 296 papers cited at least once and which were published in Physical Review D (PRD) 
in volumes 11 through 50 (1975-1994). In his study, Redner addressed the citations of pub- 
lications, in variance with Laherrere and Sornette ||, who adressed, in a similar study, the 
citations of authors. If we denote by x the number of citations and by N(x) the number of 
papers that are cited x times. The main results of the study were that, for relatively large 
values of x, N(x) oc l/x a with a ~ 3, whereas, for relatively small values of x, the data were 
reasonably well fitted with a stretched exponential, i.e., N(x) oc exp[— (x/xo)@], (3 and x 
being the fitting parameters ((3 ~ 0.44 and 0.39 for the ISI and the PRD data respectively); 
see Figs, (a) and (b). Since a streched exponential by no means asymptotically provides an 
inverse power law, the author concluded that large x and low x phenomena are of different 
nature (in the author's words, " These results provide evidence that the citation distribution 
is not described by a single function over the entire range of citation count"). While the 
phenomenon exhibited by Redner is of great interest, we must disagree with his conclusion. 
It is the central purpose of our present effort to develop arguments within the nonextensive 
statistical mechanics mentioned above , and along the lines of Denisov, which will lead to a 
single function N(x) having, like the streched exponential, only two fitting parameters. This 
function is of the power-law type and will turn out to fit both ISI and PRD experimental 
data sensibly better than the forms described above. 

Before presenting our arguments for this specific problem, let us briefly introduce the 
nonextensive formalism we are referring to. If the physical system we are focusing on in- 
volves long-range interactions or long-range microscopic memory or (multi)fractal boundary 
conditions, it can exhibit a quite anomalous thermodynamic behavior, which might even be 
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untraceable within Boltzmann-Gibbs (BG) statistical mechanics. To overcome at least some 
of these pathological situations, an entropic form S g has been proposed j| which yields 
a generalization of standard statistical mechanics and thermodynamics. This entropy is 
defined as follows: 

s 9 = k — £p. = i;?^) (l) 

where k is a positive constant (from now on taken to be unity, without loss of generality). 
In the limit q — ► 1, we recover the usual BG entropy, i.e., Si = — J2iPi In. Pi- A property 
which characterizes the above generalized entropic form is the following: if we have two 
independent systems A and B such that Pij +B = pf pf, then 

S q (A + B) = S q (A) + S g (B) + (1 - q)S q (A)S q (B) (2) 

Consequently, if q > 1, = 1 or < 1, S g is subextensive, extensive or superextensive. Optimiza- 
tion of this entropy with appropriate constraints provides equilibrium distributions which 
are of the power-law type, and which recover the exponential Boltzmann factor only in the 
limit q — > 1. 

This thermostatistics has provided interesting insights onto a variety of physical systems 



such as two-dimensional turbulence in pure-electron plasma ||, self-gravitating systems [10 



cosmology [JTTJ , solar neutrinos |l2j , Levy || and correlated § anomalous diffusions, inverse 



bremsstrahlung absorption in plasma [fL3|l , quantum scattering [14], one-dimensional maps 



15| , a variety of self-organized critical models ||16|| , long-range interaction conservative sys- 



tems ||17|| , processing of EEG signals of epileptic humans and turtles [18|j, among others (see 



p9| for a review). To theoretically study such complex systems within this nonextensive 
formalism, some approaches (besides, naturally, the usual analytic and numerical methods) 
are now available such as the generalizations of (i) Kubo's linear response theory, (ii) Feyn- 
man's perturbation theory as well as the Bogoliubov's inequality (basis of the variational 
method), (iii) Green's functions, and (iv) Feynman's path integral (respectively generalized, 
in the realm of nonextensivity, in p0|-[2~3|| ) . 

Let us now focuse on our specific problem, namely the distributions of citations. The 
corresponding entropic form is given by 

1 _ V°° rfl 

Q _ ± ^X = l fx (o\ 

bq ~ q-l {6) 
The optimization of this entropy with the corresponding constraints f24] , namely 

oo 

5> = 1 (4) 



x=l 



and 



< x > q = — ^ — = constant , (5) 

Z_/x=l Px 



yields 
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Pqfr) = —± ^ 6 

££i[l-(l-?)Ai/]i=5 



where, unless q — 1, A is not (see ||24|| ) the Lagrange parameter associated with constraint (5), 
but can nevertheless be determined through that constraint. This distribution is expected 
to be an excellent approximation for x not too small (say, not below 5), but departures 
would not be surprising while approaching unity. Indeed, we have deduced Eq. (6) through 
generic entropic considerations and not by using a specific model. Also, for precisely the same 
reason, q and A are to be considered as free parameters within the present phenomenological 
approach. 

Eq. (6) implies that the so-called escort distribution is given by 

P ( x ) = b^)]' = [l-(l-q)Xx]^ = fei jjjjg^-grj 

q[ Z™i\p q (y)] q E£i[i-(i-g)Ai/]^ [i + {g-i)Xx]& 



This escort distribution is to be identified (see [pi]) with the above introduced experimental 
distribution N(x), hence 



or, equivalent ly, 



N(x) = N(l) E + fa- 1 )*!^ ( 8 ) 
[l + (g-l)Aa;] — 



An 

N(x) = ^ (9) 

[l + (g- l)Ax] — 



where we have simplified the notation by introducing An. The fittings of both ISI and PRD 
data series using this functional form are exhibited in Figs, (c) and (d). We can appreciate 
that they are considerably better (in both precision and completeness) than those appearing 
in 0. In particular, we have obtained, for the ISI series, q ~ 1.53, hence q/(q — 1) ~ 2.89, 
which is clearly compatible with the approximate exponent 3 advanced in J7J. 

As a summarizing conclusion, we suggest that, in variance with what is stated in 0, the 
present interesting linguistic-like phenomenon revealed by Redner appears to emerge from 
one and the same reason for practically the entire range of citation score x. Furthermore, this 
reason appears to be deeply related to thermostatistical nonextensivity. Specific microscopic 
models are of course very welcome in order to achieve a more concrete insight, and also for 
addressing the exceedingly small values of x, which are out of the scope of the present 
phenomenological approach. 

Finally, CNPq and PRONEX (Brazilian Agencies) are acknowledged for partial support. 
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Caption: ISI and PRD distributions of citations (experimental data and fittings). From 
@: (a) log-linear plot and (b) log-log plot. Present work: (c) log-linear plot and (d) log-log 
plot (Eq.(9) has been used with the values for (q, A, N ) indicated in the figure). 
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This figure "marcio.gif" is available in "gif" format from: 



http://arXiv.org/ps/cond-mat/9903433vl 



